Introduction

This R-markdown documents how to interact with the ‘Danmark set fra Luften’-API using R. R-markdown is a format that combines explanatory text along with code thus making it possible to do thorough description of decisions in regards to the data processing.
For further description of the R Markdown format see R for Datatscience, Chapter 27: R Markdown

The primary objective of this R Markdown is to introduces the XML format. This is done by demonstrating how to load the aerial photo data into R through an API. In this particular case we have chosen to have the API give us our data in the XML(Extensible Markup Language)-format. Even though the API has the possibility to return in other formats we chose the XML-format since CSV and JSON already have been introduced in the previous lessons.

XML is a common way of exchanging data over the Web that stores data in an hierarchical structure.[^1] A good argument for learning XML is that there is a lot of standard formats, which is based on XML.[^2] One of these is the Text Encoding Iniative(TEI), which is a standard for representing digital texts. For instance a digital representation of a theatre play. In this case stage direction will be marked as such and lines will be marked as well as other relevant info about the role like name, age, gender etc. Another example of these XML-based standards, and the focal point of this lesson, is the standard Keyhole Markup Language(KML), which is used for storing geographical data. But before venturing into KML-data it is a good idea to get familiarised with a simple XML-example:

<?xml version="1.0" encoding="UTF-8"?>
<artists xmlns="https://chc.au.dk">
    <artist id="1">
        <name>Christoffer Wilhelm</name>
        <surname>Eckersberg</surname>
    </artist>
    <artist id="2">
        <name>Wilhelm</name>
        <surname>Marstrand</surname>
        <movement>Romanticism</movement>
    </artist>
</artists>

As mentioned before XML (and thus KML) is hierarchical. In the example above the top level is “artists”. Beneath this level is the “artist”-level, which holds information about a given artist beneath it. In this example there is two artists. The information about the artists are “name” and “surname” and in the artist with ID 2 we also have the the artistic movement. There could be several more information such as “birthday”, “education”, etc. The main thing here is “name” belongs under “artist” which belongs under “artists”. These relationships can be described as parents and children. So “artist” is a child of “artists” and artists thus have two children: <artist id:“1”> and <artist id:“2”>. These likewise have children: , and .

This relationship can also be imagined as tree with a root with branches:

The simple xml exampel imagined as a tree Tree silhuet based on photo by Mila Tovar on Unsplash

For further information and explanation of XML see page 42 in Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. Automated Data Collection with R: a Practical Guide to Web Scraping and Text Mining. New York: WILEY, 2014.

The mark-up language hierarchical structure is not really helpful in R, where dataframes are preferred instead. I other words the data is wanted like this in R:

artists dataframe:

person_id name surname movement
1 Christoffer Wilhelm Eckersberg NA
2 Wilhelm Marstrand Romantiscism


This little example shows the fundamental task for this lesson. This brief introduction to XML is by no means exhaustive, but will be elaborated more as the lesson progresses. One important thing to bear in mind is that this lesson isn’t trying to be a comprehensive description of XML, but provide enough knowledge in order to wrangle the XML into a dataframe structure.

Before diving into the interaction with the API, the first challenge of this lesson is to parse the previous simple example into a R-dataframe. As usually in R the first step is to load the relevant libraries.

Loading libraries

R works with libraries that add different functionalities to the root of R functions. In this case, the relevant packages are:

library(xml2)
library(tidyverse)

For further information on these libraries see:
https://www.tidyverse.org/packages/
https://xml2.r-lib.org

As an initial step the simple example from before is loaded into R:

artists <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
<artists xmlns="https://chc.au.dk">
    <artist id="1">
        <name>Christoffer Wilhelm</name>
        <surname>Eckersberg</surname>
         <movement></movement>
    </artist>
    <artist id="2">
        <name>Wilhelm</name>
        <surname>Marstrand</surname>
        <movement>Romanticism</movement>
    </artist>
</artists>')
artists
## {xml_document}
## <artists xmlns="https://chc.au.dk">
## [1] <artist id="1">\n  <name>Christoffer Wilhelm</name>\n  <surname>Eckersber ...
## [2] <artist id="2">\n  <name>Wilhelm</name>\n  <surname>Marstrand</surname>\n ...

When exploring this xml-document through R we notice that the first line i curly brackets describes what type of document is at hand. Not surprisingly it says “xml document”. The next line is the root of our xml-document followed by it’s two children.
Before trying to extract names and surnames of the artists it is necessesary to dwell on the root:

The important thing here is the XML-namespace(xmlns) and has to do with the fact that XML is used in very different formats for different uses. Sometimes these formats are intertwined and a simple example would the attribute “id”. It is easy to imagine two formats both containing an id-attribute and in this situation and in this situation it is important to tell the two from each other. This is where the namespace comes in. For more information on XML Namespaces visit the W3 Consortium’s recommendation.

The xml2 package has a function for seeing namespaces in a given xml document:

artists %>% 
  xml_ns()
## d1 <-> https://chc.au.dk

This means that in the current situation the namespace is named d1. This will be importan in a moment.

Taking a step back the challenge is to use R to select the relevant elements and piece them together again as a dataframe. But how is single elements of the XML targetted? The answer is Xpath(XML Path Language) in combination with the namespace. Xpath is used to define the path through the XML hierarchy to a given piece of information in the XML. In other words Xpath is a finger used to pointing a elements of interest. For more info see W3 Consortium’s recomendation on Xpath.

In this example the Xpath finger is pointing to the two surnames of our artists xml document. Notice how the namespace “d1” is prefixed to the element(surname). Thus two surname-elements are extracted:

artists %>% 
  xml_find_all("//d1:surname")
## {xml_nodeset (2)}
## [1] <surname>Eckersberg</surname>
## [2] <surname>Marstrand</surname>

The two surnames are still in the form of a xml nodeset and still xml elements. The next step is to extract the surnames as text:

artists %>% 
  xml_find_all("//d1:surname") %>% 
  xml_text()
## [1] "Eckersberg" "Marstrand"

This way only the actual surnames are extracted and no longer as a xml nodeset, but as a vector.

artists %>% 
  xml_find_all("//@id") %>% 
  xml_text()
## [1] "1" "2"
artists %>% 
  xml_find_all("//d1:movement") %>% 
  xml_text()
## [1] ""            "Romanticism"
artists_df <- tibble(id = artists %>%  xml_find_all("//@id") %>% xml_integer(),
                     name = artists %>% xml_find_all("//d1:name") %>% xml_text(),
                     surname = artists %>% xml_find_all("//d1:surname") %>% xml_text(),
                     movement = artists %>%  xml_find_all("//d1:movement") %>% xml_text())

artists_df

Loading the data

The data is delivered through the ‘Danmark set fra Luften’-API as mentioned before. This is done with a request-URL leading to our data. In the previous lessons we have created the the URL via the API’s Swagger interfaces. However in the case of “Danmark set fra Luften”-API, there isn’t a Swagger interface. Instead it has a somewhat minimal interface with four URL corresponding to the four formats that the API delivers.

Constructing URL
Constructing URL

By navigating the map the four URLs above changes. What changes in the URL is the coordinates used to specify what area to request data from. In this case we have defined an area around the old Manor house of Gl. Estrup from where we want all the aerial photos. Fur further information on constructing the request URL see the API documentation: https://docs.google.com/document/pub?id=16daS_dAe2nrqCiZeOLb3N0GkfMalbJHAwQhMhWVy0UI

The interfaces gives us this URL:

http://www5.kb.dk/cop/syndication/images/luftfo/2011/maj/luftfoto/subject203?bbo=10.347597599029541,56.438728608888425,10.341149568557741,56.4367119455412&itemsPerPage=500&page=1&format=kml

Let us break down this URL to better understand it’s composition:

Base URL:

http://www5.kb.dk/cop/syndication/images/luftfo/2011/maj/luftfoto/subject203?

Bounding box(bbo) - a way of defining from which area you are interested in data from - here defined with coordinates. These four coordinates creates a square corresponding to the view of Gl. Estrup above.

bbo=10.347597599029541,56.438728608888425,10.341149568557741,56.4367119455412

Since a bounding box can be as large as you like there can be alot of data points (aerial photos with metadata) within it. In order not to crash the system by sending a too large data package the data is organised in pages - here we define how many data points we want pr. page - Maximum is 5000.

&itemsPerPage=500

The current page of the results.

&page=1

The last part is where we specify what format we want the data in - here we have chosen kml - which is a flavour of XML.

&format=kml

Next step is to load the data from the entire URL into R. We use the function read_xmlfor this and stores the data in a new element called ‘estrup_xml’:

estrup_xml <- read_xml("http://www5.kb.dk/cop/syndication/images/luftfo/2011/maj/luftfoto/subject203?bbo=10.347597599029541,56.438728608888425,10.341149568557741,56.4367119455412&itemsPerPage=500&page=1&format=kml")

So right now we have our xml-document within R. Right beneath the {xml_document} the format(kml) is specified with it’s xml-namespace:

estrup_xml
## {xml_document}
## <kml xmlns="http://www.opengis.net/kml/2.2">
## [1] <Document>\n  <Style id="balloon-style">\n    <BalloonStyle>\n      <text ...

But it is still not in the dataframe form that we want it to be in order to use the analytical powers of R. But before we can start this progress we need to inspect the structure of the xml-file. Otherwise we wont be able to identify which elements we are interested in extracting from the xml-document to our dataframe. From the introduction we know that xml is hierarchical structured. This structure is commonly referred to as a family three, which is also the case with the xml2package functions, that we use here. Lets inspect our root of the estrup_xml:

estrup_xml %>% 
  xml_root()
## {xml_document}
## <kml xmlns="http://www.opengis.net/kml/2.2">
## [1] <Document>\n  <Style id="balloon-style">\n    <BalloonStyle>\n      <text ...

This the same as when we just typed the name ‘estrup_xml’. Not suprisingly when just typing the estrup_xml, we get the root. Instead let’s take a look at the roots children:

estrup_xml %>% 
  xml_children()
## {xml_nodeset (1)}
## [1] <Document>\n  <Style id="balloon-style">\n    <BalloonStyle>\n      <text ...

So the root only have a single children. Document. We now inspect this children by it’s number (1) in the xml_child()-function:

estrup_xml %>% 
  xml_child(1)
## {xml_node}
## <Document>
##  [1] <Style id="balloon-style">\n  <BalloonStyle>\n    <text><![CDATA[\n      ...
##  [2] <startIndex xmlns="http://a9.com/-/spec/opensearch/1.1/">1</startIndex>
##  [3] <itemsPerPage xmlns="http://a9.com/-/spec/opensearch/1.1/">500</itemsPer ...
##  [4] <Query xmlns="http://a9.com/-/spec/opensearch/1.1/" role="request" searc ...
##  [5] <totalResults xmlns="http://a9.com/-/spec/opensearch/1.1/">22</totalResu ...
##  [6] <link xmlns="http://www.w3.org/2005/Atom" href="http://www.kb.dk/cop/ima ...
##  [7] <link xmlns="http://www.w3.org/2005/Atom" href="http://www.kb.dk/cop/syn ...
##  [8] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [9] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [10] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [11] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [12] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [13] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [14] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [15] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [16] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [17] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [18] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [19] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [20] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## ...

“Documents” is parent to a whole lot more children - the interesting ones here is the “Placemark” - Lets examine one of these (number 8):

estrup_xml %>% 
  xml_child(1) %>% 
  xml_child(8)
## {xml_node}
## <Placemark id="object733663" xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.opengis.net/kml/2.2">
## [1] <name>Gl. Estrup</name>
## [2] <styleUrl>#balloon-style</styleUrl>
## [3] <atom:link href="http://www.kb.dk/images/luftfo/2011/maj/luftfoto/object7 ...
## [4] <Point>\n  <coordinates>10.344582401790175,56.43786909151412</coordinates ...
## [5] <ExtendedData>\n  <Data name="subjectLink">\n    <value>http://www.kb.dk/ ...

So it looks like “Placemark” is the child that contains the data points (aerial photos). But Placemark is also a parent of “ExtendedData”(number five). To keep an overview of the sitution comments about which child is what is added from now on:

estrup_xml %>% 
  # The Document-child:
  xml_child(1) %>% 
  # The first Placemark-child:
  xml_child(8) %>% 
  # The ExtendedData-child: 
  xml_child(5)
## {xml_node}
## <ExtendedData>
##  [1] <Data name="subjectLink">\n  <value>http://www.kb.dk/images/luftfo/2011/ ...
##  [2] <Data name="subjectName">\n  <value>Jyllands Herregårdsmuseum.</value>\n ...
##  [3] <Data name="subjectCreatorName">\n  <value/>\n</Data>
##  [4] <Data name="subjectCreationDate">\n  <value>1928-1933</value>\n</Data>
##  [5] <Data name="subjectGenre">\n  <value>Skråfoto</value>\n</Data>
##  [6] <Data name="subjectNote">\n  <value/>\n</Data>
##  [7] <Data name="subjectGeographic">\n  <value>Danmark, Jylland, Auning</valu ...
##  [8] <Data name="subjectImageSrc">\n  <value>http://kb-images.kb.dk/DAMJP2/on ...
##  [9] <Data name="subjectThumbnailSrc">\n  <value>http://kb-images.kb.dk/DAMJP ...
## [10] <Data name="recordCreationDate">\n  <value/>\n</Data>
## [11] <Data name="recordChangeDate">\n  <value/>\n</Data>
## [12] <Data name="correctness">\n  <value>1</value>\n</Data>
## [13] <Data name="interestingness">\n  <value>0</value>\n</Data>

So within ExtendedData we have thirteen types of metadata. Here the subjectGeographic(number 7) is targeted:

estrup_xml %>% 
  # The Document-child:
  xml_child(1) %>% 
  # The first Placemark-child:
  xml_child(8) %>% 
  # The ExtendedData-child: 
  xml_child(5) %>% 
  # The subjectGeographic-child
  xml_child(7)
## {xml_node}
## <Data name="subjectGeographic">
## [1] <value>Danmark, Jylland, Auning</value>

subjectGeographic contain only one child:

estrup_xml %>% 
  # The Document-child:
  xml_child(1) %>% 
  # The first Placemark-child:
  xml_child(8) %>% 
  # The ExtendedData-child: 
  xml_child(5) %>% 
  # The subjectGeographic-child
  xml_child(7) %>% 
  # The value-child 
  xml_child(1) 
## {xml_node}
## <value>

But the actual piece of information is still wrapped in a xml-tag. This is extracted with the function: xml_text.

estrup_xml %>% 
  # The Document-child:
  xml_child(1) %>% 
  # The first Placemark-child:
  xml_child(8) %>% 
  # The ExtendedData-child: 
  xml_child(5) %>% 
  # The subjectGeographic-child
  xml_child(7) %>% 
  # The value-child 
  xml_child(1) %>% 
  # Extracting the text
  xml_text()
## [1] "Danmark, Jylland, Auning"

By changing the numerical value in the fourth xml_child-function to 8 the subjectImageSrc is returned (subjectThumbnailSrc is the eigth element under ExtendedData):

estrup_xml %>% 
  # The Document-child:
  xml_child(1) %>% 
  # The first Placemark-child:
  xml_child(8) %>% 
  # The ExtendedData-child: 
  xml_child(5) %>% 
  # The subjectThumbnailSrc-child
  xml_child(8) %>% 
  # The value-child 
  xml_child(1) %>% 
  # Extracting the text
  xml_text()
## [1] "http://kb-images.kb.dk/DAMJP2/online_master_arkiv_6/non-archival/Maps/FYNLUFTFOTO/Danmark/NordiskLuftfoto/NL64000-64199/NL64174_001/full/full/0/native.jpg"

This is the link to the thumbnail of the aerial photograph stored within the first Placemark:

The next step will be to utilise the insights gained in the previous section to extract all the childs of ‘Placemark’ and all the childs under ‘ExtendedData’ by using Xpath.

Parsing the xml file to a dataframe using Xpath

Since the ‘Placemark’-childs are some levels above the root of the xml-file the first thing we need to do is extract them.

First let’s examine the first ‘Placemark’ once more:

estrup_xml %>% 
  xml_child(1) %>% 
  xml_child(8)
## {xml_node}
## <Placemark id="object733663" xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.opengis.net/kml/2.2">
## [1] <name>Gl. Estrup</name>
## [2] <styleUrl>#balloon-style</styleUrl>
## [3] <atom:link href="http://www.kb.dk/images/luftfo/2011/maj/luftfoto/object7 ...
## [4] <Point>\n  <coordinates>10.344582401790175,56.43786909151412</coordinates ...
## [5] <ExtendedData>\n  <Data name="subjectLink">\n    <value>http://www.kb.dk/ ...

The challenge is now to extract all the Placemark-childs instead of examining one at a time. As shown in the artist-example this is done with Xpath and the corresponding namespace. The xml_ns()function will provide the namespaces:

estrup_xml %>% 
  xml_ns()
## d1     <-> http://www.opengis.net/kml/2.2
## d2     <-> http://a9.com/-/spec/opensearch/1.1/
## d3     <-> http://a9.com/-/spec/opensearch/1.1/
## d4     <-> http://a9.com/-/spec/opensearch/1.1/
## d5     <-> http://a9.com/-/spec/opensearch/1.1/
## d6     <-> http://www.w3.org/2005/Atom
## d7     <-> http://www.w3.org/2005/Atom
## d8     <-> http://www.opengis.net/kml/2.2
## d9     <-> http://www.opengis.net/kml/2.2
## d10    <-> http://www.opengis.net/kml/2.2
## d11    <-> http://www.opengis.net/kml/2.2
## d12    <-> http://www.opengis.net/kml/2.2
## d13    <-> http://www.opengis.net/kml/2.2
## d14    <-> http://www.opengis.net/kml/2.2
## d15    <-> http://www.opengis.net/kml/2.2
## d16    <-> http://www.opengis.net/kml/2.2
## d17    <-> http://www.opengis.net/kml/2.2
## d18    <-> http://www.opengis.net/kml/2.2
## d19    <-> http://www.opengis.net/kml/2.2
## d20    <-> http://www.opengis.net/kml/2.2
## d21    <-> http://www.opengis.net/kml/2.2
## d22    <-> http://www.opengis.net/kml/2.2
## d23    <-> http://www.opengis.net/kml/2.2
## d24    <-> http://www.opengis.net/kml/2.2
## d25    <-> http://www.opengis.net/kml/2.2
## d26    <-> http://www.opengis.net/kml/2.2
## d27    <-> http://www.opengis.net/kml/2.2
## d28    <-> http://www.opengis.net/kml/2.2
## d29    <-> http://www.opengis.net/kml/2.2
## atom   <-> http://www.w3.org/2005/Atom
## atom1  <-> http://www.w3.org/2005/Atom
## atom2  <-> http://www.w3.org/2005/Atom
## atom3  <-> http://www.w3.org/2005/Atom
## atom4  <-> http://www.w3.org/2005/Atom
## atom5  <-> http://www.w3.org/2005/Atom
## atom6  <-> http://www.w3.org/2005/Atom
## atom7  <-> http://www.w3.org/2005/Atom
## atom8  <-> http://www.w3.org/2005/Atom
## atom9  <-> http://www.w3.org/2005/Atom
## atom10 <-> http://www.w3.org/2005/Atom
## atom11 <-> http://www.w3.org/2005/Atom
## atom12 <-> http://www.w3.org/2005/Atom
## atom13 <-> http://www.w3.org/2005/Atom
## atom14 <-> http://www.w3.org/2005/Atom
## atom15 <-> http://www.w3.org/2005/Atom
## atom16 <-> http://www.w3.org/2005/Atom
## atom17 <-> http://www.w3.org/2005/Atom
## atom18 <-> http://www.w3.org/2005/Atom
## atom19 <-> http://www.w3.org/2005/Atom
## atom20 <-> http://www.w3.org/2005/Atom
## atom21 <-> http://www.w3.org/2005/Atom

Apparently the document with the aerial photos contains a lot of namespaces, but further inspection shows that there are several duplicates among the namespaces. Since the document is in the kml-format only the first namespace, d1, will be used:

d1 <-> http://www.opengis.net/kml/2.2

From the previous examination of the document it is know that Placemark is a child of Document. Using this knowledge in combination with the namespace and the xml_find_all-function every 22 Placemark is extracted:

estrup_xml %>% 
  xml_find_all("d1:Document/d1:Placemark")
## {xml_nodeset (22)}
##  [1] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [2] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [3] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [4] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [5] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [6] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [7] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [8] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [9] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [10] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [11] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [12] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [13] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [14] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [15] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [16] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [17] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [18] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [19] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [20] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## ...

Now that the xpath for targeting the 18 Placemark containing the aerial photos is in place, the next step is to start picking out single data element from the children of the Placemarks. For example (and in combination med xml_text() in order to get a vector instead of an xml_nodeset):

estrup_xml %>% 
  xml_find_all("d1:Document/d1:Placemark/d1:name") %>% 
  xml_text()
##  [1] "Gl. Estrup"                                
##  [2] "- 1957 -"                                  
##  [3] "Randersvej 8 - Gammel Estrup - 1936-1987 -"
##  [4] "Gl. Estrup"                                
##  [5] "Gl. Estrup"                                
##  [6] "Gammel Estrup"                             
##  [7] "Gl. Estrup"                                
##  [8] "Randersvej 2 - Gammel Estrup - 1936-1987 -"
##  [9] "Randersvej 6 - Gammel Estrup - 1936-1987 -"
## [10] "- 1961 -"                                  
## [11] "- 1961 -"                                  
## [12] "- 1955 - Gammel Estrup"                    
## [13] "- 1956 -"                                  
## [14] "- 1956 -"                                  
## [15] "- 1956 -"                                  
## [16] "- 1990 -"                                  
## [17] "- 1990 -"                                  
## [18] "- 1990 -"                                  
## [19] "- 1952 - Gammel Estrup"                    
## [20] "- 1952 - Gammel Estrup"                    
## [21] " (1957)\n                    "             
## [22] " (1990)\n                    "

If the desired information is stored within an attribute the situation is somewhat different:

estrup_xml %>% 
  xml_find_all("d1:Document/d1:Placemark")
## {xml_nodeset (22)}
##  [1] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [2] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [3] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [4] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [5] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [6] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [7] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [8] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
##  [9] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [10] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [11] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [12] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [13] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [14] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [15] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [16] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [17] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [18] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [19] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## [20] <Placemark xmlns:atom="http://www.w3.org/2005/Atom" xmlns="http://www.op ...
## ...

In this case the desired information is the id-attribute in the Placemark tag. The way to extract this similiar to the way before. Instead of xml_text() the function xml_attr() is supplied with the desired attribute-name:

estrup_xml %>%
  xml_find_all("d1:Document/d1:Placemark") %>% 
  xml_attr("id")
##  [1] "object733663"  "object1945394" "object1014609" "object733664" 
##  [5] "object733666"  "object733742"  "object733665"  "object1014612"
##  [9] "object1014614" "object699812"  "object699811"  "object1870205"
## [13] "object1760676" "object1715024" "object1760687" "object2122693"
## [17] "object2122794" "object2122711" "object1728826" "object1728857"
## [21] "object1945396" "object1519137"

Extracting data from ExtendedData-child of the Placemark-tag is somewhat more difficult to extract across all the Placemark-tag. ExtendedData has 13 children:

estrup_xml %>% 
  xml_find_first("d1:Document/d1:Placemark/d1:ExtendedData")
## {xml_node}
## <ExtendedData>
##  [1] <Data name="subjectLink">\n  <value>http://www.kb.dk/images/luftfo/2011/ ...
##  [2] <Data name="subjectName">\n  <value>Jyllands Herregårdsmuseum.</value>\n ...
##  [3] <Data name="subjectCreatorName">\n  <value/>\n</Data>
##  [4] <Data name="subjectCreationDate">\n  <value>1928-1933</value>\n</Data>
##  [5] <Data name="subjectGenre">\n  <value>Skråfoto</value>\n</Data>
##  [6] <Data name="subjectNote">\n  <value/>\n</Data>
##  [7] <Data name="subjectGeographic">\n  <value>Danmark, Jylland, Auning</valu ...
##  [8] <Data name="subjectImageSrc">\n  <value>http://kb-images.kb.dk/DAMJP2/on ...
##  [9] <Data name="subjectThumbnailSrc">\n  <value>http://kb-images.kb.dk/DAMJP ...
## [10] <Data name="recordCreationDate">\n  <value/>\n</Data>
## [11] <Data name="recordChangeDate">\n  <value/>\n</Data>
## [12] <Data name="correctness">\n  <value>1</value>\n</Data>
## [13] <Data name="interestingness">\n  <value>0</value>\n</Data>

The problem here is that all these children are a Data-tag and since the previous method of using Xpath to point at elements of interest has been pointing at tags. Using this method every Data point from each of the 22 Placemarks are returned:

estrup_xml %>% 
  xml_find_all("d1:Document/d1:Placemark/d1:ExtendedData/d1:Data")
## {xml_nodeset (286)}
##  [1] <Data name="subjectLink">\n  <value>http://www.kb.dk/images/luftfo/2011/ ...
##  [2] <Data name="subjectName">\n  <value>Jyllands Herregårdsmuseum.</value>\n ...
##  [3] <Data name="subjectCreatorName">\n  <value/>\n</Data>
##  [4] <Data name="subjectCreationDate">\n  <value>1928-1933</value>\n</Data>
##  [5] <Data name="subjectGenre">\n  <value>Skråfoto</value>\n</Data>
##  [6] <Data name="subjectNote">\n  <value/>\n</Data>
##  [7] <Data name="subjectGeographic">\n  <value>Danmark, Jylland, Auning</valu ...
##  [8] <Data name="subjectImageSrc">\n  <value>http://kb-images.kb.dk/DAMJP2/on ...
##  [9] <Data name="subjectThumbnailSrc">\n  <value>http://kb-images.kb.dk/DAMJP ...
## [10] <Data name="recordCreationDate">\n  <value/>\n</Data>
## [11] <Data name="recordChangeDate">\n  <value/>\n</Data>
## [12] <Data name="correctness">\n  <value>1</value>\n</Data>
## [13] <Data name="interestingness">\n  <value>0</value>\n</Data>
## [14] <Data name="subjectLink">\n  <value>http://www.kb.dk/images/luftfo/2011/ ...
## [15] <Data name="subjectName">\n  <value/>\n</Data>
## [16] <Data name="subjectCreatorName">\n  <value>Sylvest Jensen Luftfoto</valu ...
## [17] <Data name="subjectCreationDate">\n  <value>1957</value>\n</Data>
## [18] <Data name="subjectGenre">\n  <value>Skråfoto</value>\n</Data>
## [19] <Data name="subjectNote">\n  <value/>\n</Data>
## [20] <Data name="subjectGeographic">\n  <value>Danmark, Jylland, Auning</valu ...
## ...

What is unique about these Data tags is their name-attribute’s value. Thus the solution is to select Data-tag with a given name-attribute-value. E.g. subjectCreationDate:

estrup_xml %>%  
  xml_find_all('d1:Document/d1:Placemark/d1:ExtendedData/d1:Data[@name="subjectCreationDate"]')
## {xml_nodeset (22)}
##  [1] <Data name="subjectCreationDate">\n  <value>1928-1933</value>\n</Data>
##  [2] <Data name="subjectCreationDate">\n  <value>1957</value>\n</Data>
##  [3] <Data name="subjectCreationDate">\n  <value>1936-1987</value>\n</Data>
##  [4] <Data name="subjectCreationDate">\n  <value>1928-1933</value>\n</Data>
##  [5] <Data name="subjectCreationDate">\n  <value>1928-1933</value>\n</Data>
##  [6] <Data name="subjectCreationDate">\n  <value>1928-1933</value>\n</Data>
##  [7] <Data name="subjectCreationDate">\n  <value>1928-1933</value>\n</Data>
##  [8] <Data name="subjectCreationDate">\n  <value>1936-1987</value>\n</Data>
##  [9] <Data name="subjectCreationDate">\n  <value>1936-1987</value>\n</Data>
## [10] <Data name="subjectCreationDate">\n  <value>1961</value>\n</Data>
## [11] <Data name="subjectCreationDate">\n  <value>1961</value>\n</Data>
## [12] <Data name="subjectCreationDate">\n  <value>1955</value>\n</Data>
## [13] <Data name="subjectCreationDate">\n  <value>1956</value>\n</Data>
## [14] <Data name="subjectCreationDate">\n  <value>1956</value>\n</Data>
## [15] <Data name="subjectCreationDate">\n  <value>1956</value>\n</Data>
## [16] <Data name="subjectCreationDate">\n  <value>1990</value>\n</Data>
## [17] <Data name="subjectCreationDate">\n  <value>1990</value>\n</Data>
## [18] <Data name="subjectCreationDate">\n  <value>1990</value>\n</Data>
## [19] <Data name="subjectCreationDate">\n  <value>1952</value>\n</Data>
## [20] <Data name="subjectCreationDate">\n  <value>1952</value>\n</Data>
## ...

The three previous examples extracting data using Xpath in combination with xml_text() and xml_attr() can easy be incorporated into the tibble()-function i order to create a data frame:

tibble(
  id = estrup_xml %>%
    xml_find_all("d1:Document/d1:Placemark") %>% 
    xml_attr("id"),
  name = estrup_xml %>% 
    xml_find_all("d1:Document/d1:Placemark/d1:name") %>% 
    xml_text(),
  subjectCreationDate = estrup_xml %>%  
    xml_find_all('d1:Document/d1:Placemark/d1:ExtendedData/d1:Data[@name="subjectCreationDate"]') %>% 
    xml_text()
  )

Using the knowledge from this lesson it is possible to create a larger dataframe that holds all the data from the Placemarks:

tibble(
  id = estrup_xml %>%  xml_find_all("d8:Document/d8:Placemark") %>% xml_attr("id"),
  name = estrup_xml %>% xml_find_all("d8:Document/d8:Placemark/d8:name") %>% xml_text(),
  link = estrup_xml %>% xml_find_all("d8:Document/d8:Placemark/atom:link") %>%  xml_attr("href"),
  coordinates = estrup_xml %>% xml_find_all("d8:Document/d8:Placemark/d8:Point/d8:coordinates") %>%  xml_text(),
  subjectLink = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='subjectLink']"
  ) %>% xml_text(),
  subjectName = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='subjectName']"
  ) %>% xml_text(),
  subjectCreatorName = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='subjectCreatorName']"
  ) %>% xml_text(),
  subjectCreationDate = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='subjectCreationDate']"
  ) %>% xml_text(),
  subjectGenre = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='subjectGenre']"
  ) %>% xml_text(),
  subjectNote = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='subjectNote']"
  ) %>% xml_text(),
  subjectGeographic = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='subjectGeographic']"
  ) %>% xml_text(),
  subjectImageSrc = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='subjectImageSrc']"
  ) %>% xml_text(),
  subjectThumbnailSrc = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='subjectThumbnailSrc']"
  ) %>% xml_text(),
  recordCreationDate = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='recordCreationDate']"
  ) %>% xml_text(),
  recordChangeDate = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='recordChangeDate']"
  ) %>% xml_text(),
  correctness = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='correctness']"
  ) %>% xml_text(),
  interestingness = estrup_xml %>% xml_find_all(
    "d8:Document/d8:Placemark/d8:ExtendedData/d8:Data[@name='interestingness']"
  ) %>% xml_text()
)